Optimizing research with GPUs on Hoffman2

Charles Peterson

👋 Welcome Everyone!

Discover the power of GPU computing to accelerate your research on UCLA’s Hoffman2 cluster! This workshop is designed to guide you through the essentials of GPU utilization, enhancing your projects with cutting-edge computational efficiency.

🔑 Key Topics:

  • Understanding GPU architecture and its benefits
  • Techniques for compiling and optimizing GPU code
  • Hands-on access to Hoffman2’s advanced GPU resources
  • Utilizing Python and R for GPU computing

For suggestions:

📖 Access the Workshop Files

This presentation and accompanying materials are available on 🔗 UCLA OARC GitHub Repository

You can view the slides in:

Each file provides detailed instructions and examples on the various topics covered in this workshop.

Note: 🛠️ This presentation was built using Quarto and RStudio.

Clone this repository on Hoffman2 to access the workshop files:

git clone https://github.com/ucla-oarc-hpc/WS_HPC-GPU.git

GPU Basics

What are GPUs?

Graphic Processing Units (GPUs) were initially developed for processing graphics and visual operations, as CPUs were too slow for these tasks. The architecture of GPUs allows them to handle massively parallel tasks efficiently.

In the mid-2000s, GPUs began to be used for non-graphical computations. NVIDIA introduced CUDA, a programming language that allows for compiling non-graphic programs on GPUs, spearheading the era of General-Purpose GPU (GPGPU).

These are founds in everything! For example, PCs, mobile phones, Xbox, Playstations

Applications of GPUs

GPUs are ubiquitous and found in devices ranging from PCs to mobile phones, and gaming consoles like Xbox and PlayStation.

Though initially designed for graphics, GPUs are now used in a wide range of applications.

  • Machine Learning: Training and inference especially in Deep Learning neural networks
  • Large Language Models: Training for NLP models
  • Data Science: Accelerating data processing and analysis
  • High-Performance Computing: Simulations and scientific computing

GPU Performance

The Power of GPUs

The significant speedup offered by GPUs comes from their ability to parallelize operations over thousands of cores, unlike traditional CPUs.

GPU Workflow

GPU considerations

  • Code Optimization
    • Some codes is not suitable for GPU
  • GPU architecture
    • Some codes can run more efficiently one some GPUs over others or sometimes not at all
  • Overhead
    • Data transfer between CPU and GPU can be costly
  • Memory Management
    • GPU memory is limited and can be a bottleneck

GPUs on Hoffman2

There are multiple GPU types available in the cluster.

Each GPU has a different compute capability, memory size and clock speed.

GPU type # CUDA cores VMem SGE option
NVIDIA A100 6912 80 GB -l gpu,A100,cuda=1
Tesla V100 5120 32 GB -l gpu,V100,cuda=1
RTX 2080 Ti 4352 10 GB -l gpu,RTX2080Ti,cuda=1
Tesla P4 2560 8 GB -l gpu,P4,cuda=1

Interactive job

qrsh -l h_data=40G,h_rt=1:00:00,gpu,A100,cuda=1

Batch submission

Add the following to your job script

#SBATCH -l gpu,A100,cuda=1

Note

If you would like to host GPU nodes on Hoffman2 or get highp access, please contact us!

GPU optimization

Warning

When you using the -l gpu option, this only reserves the GPU for your job.

You will still need to use GPU optimized software and libraries to take advantage of the GPU’s parallel processing power.

The following sections will cover how to compile and run GPU optimized code on Hoffman2.

Compiling GPU Software

CUDA

CUDA (Compute Unified Device Architecture) is a parallel computing platform and application programming interface (API) from NVIDIA. It allows developers to write programs that execute on GPUs.

On Hoffman2, you can compile CUDA code by loading the cuda module. This will modify the environment to from the CUDA toolkit. This Toolkit provides the necessary libraries and compilers to compile and run CUDA code.

See all available CUDA version

modules_lookup -m cuda

Loading the CUDA 11.8 Toolkit

module load cuda/11.8

CUDA libraries

CUDA code example

CUDA code example

We will so a simple example of a CUDA code that does a Matrix multiplication (1024x1024).

  • Files are in the MatrixMult folder
    • Matrix-cpu.cpp file contains CPU (serial) code
    • Matrix-gpu.cu file contains the CUDA code
    • MatrixMult.job job submission file

Loading modules

module load gcc/10.2.0
module load cuda/12.3

Compiling code

g++ -o Matrix-cpu Matrix-cpu.cpp
nvcc -o Matrix-gpu Matrix-gpu.cu

Submitting job

qsub MatrixMult.job

GPU software

Be on the lookout for GPU optimized software for your research!

Other GPU platforms include:

  • NVIDIA’s HPC SDK (Software Developemnt Kit)
    • C/C++/Fortran compilers, Math libraries, and Open MPI
modules_lookup -m hpcsdk
  • AMD ROCm (Radeon Open Compute)
    • For AMD GPUs
modules_lookup -m amd

Using Python/R for GPU Computing

GPUs for Python and R

There are several Python and R packages that use GPUs for varsious data-intensives tasks, like Machine Learning, Deep Learning, and large-scale data processing.

Python:

  • TensorFlow: One of the most widely used libraries for machine learning and deep learning that supports GPUs for acceleration.
  • PyTorch: A popular library for deep learning that features strong GPU acceleration and is favored for its flexibility and speed.
  • cuPy: A library that provides GPU-accelerated equivalents to NumPy functions, facilitating easy transitions from CPU to GPU.
  • RAPIDS: A suite of open-source software libraries built on CUDA-X AI, providing the ability to execute end-to-end data science and analytics pipelines entirely on GPUs.
  • Numba: An open-source JIT compiler that translates a subset of Python and NumPy code into fast machine code, with capabilities for running on GPUs.
  • DASK: Python library for parallel computing maintained

R:

  • gputools: Provides a variety of GPU-enabled functions, including matrix operations, solving linear equations, and hierarchical clustering.
  • cudaBayesreg: Designed for Bayesian regression modeling on NVIDIA GPUs, using CUDA.
  • gpuR: An R package that interfaces with both OpenCL and CUDA to allow R users to access GPU functions for accelerating matrix algebra and operations.
  • Torch for R: An R machine learning framework based on PyTorch
  • TensorFlow for R: An R interface to a Python build of TensorFlow

TensorFlow and PyTorch

Installing TensorFlow and PyTorch on Hoffman2 is straightforward using the Anaconda package manager. (Check out my Workshop on using Anaconda)

Create a new conda environmnet with CUDA tools.

export PYTHON_VER=3.9
export CUDA_TK_VER=11.8
module load anaconda3/2023.03
conda create -n tf_torch_gpu python=${PYTHON_VER} cudatoolkit=${CUDA_TK_VER} pandas cudnn -c anaconda -c conda-forge -c nvidia -y
conda activate tf_torch_gpu

Install TensorFlow/PyTorch with GPU support and the NVIDIA libraries

python3 -m pip install tensorflow[and-cuda]==2.14
pip3 install tensorrt tensorrt-bindings tensorrt-libs --extra-index-url https://pypi.nvidia.com
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu${CUDA_TK_VER//./}
pip3 install scikit-learn

Verify the TensorFlow installation. Will only work if you are on a GPU-enabled node.

# TensofFlow Test:
python -c "import tensorflow as tf; print('TensorFlow is using:', ('GPU: ' + tf.test.gpu_device_name()) if tf.test.is_gpu_available() else 'CPU')"

# PyTorch Test:
python -c "import torch; print('PyTorch is using:', ('GPU: ' + torch.cuda.get_device_name(0)) if torch.cuda.is_available() else 'CPU')"

👗 Fashion MNIST

This example focuses on the “Fashion MNIST” dataset, a collection used frequently in machine learning for image recognition tasks.

Approach:

  • We will use TensorFlow to train a Netural Net model for predicting fashion categories.

Dataset Overview:

  • 📸 Images: 28x28 grayscale images of fashion products.
  • 📊 Categories: 10, with 7,000 images per category.
  • 🧮 Total Images: 70,000.

Runing Tensorflow

Now that we have TensorFlow installed, we can run some examples to test the GPU acceleration.

Files in the TF-Torch folder contain examples of using TensorFlow on Hoffman2.

Get a GPU node

qrsh -l h_data=40G,h_rt=1:00:00,gpu,A100,cuda=1

Set up your TensorFlow environment

module load anaconda3/2023.03
conda activate tf_torch_gpu

Run CPU example

python minst-train-cpu.py

Run GPU example

python minst-train-gpu.py

DNA classification with PyTorch

🧬 DNA Sequence Classification with PyTorch

  • 🧬 Objective: Create a model to classify DNA sequences into ‘gene’ or ‘non-gene’ regions.
  • Gene Regions: Segments of DNA containing codes for protein production.
  • Dataset Creation: Generate random DNA sequences labeled as ‘gene’ or ‘non-gene’.

DNA Illustration

  • 🤖 Model Development: Use PyTorch to build a model predicting the presence of ‘gene’ regions.
  • 🚀 Leveraging GPUs: Utilize the parallel processing power of GPUs for efficient training.

Running PyTorch

With PyTorch installed in the same Anaconda environment, we can now run the DNA classification example.

When running PyTorch on the GPU

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

Force running PyTorch on the GPU

device = torch.device('cpu')

Run example

python dnatorch.py

Rapids for Genomic Data Analysis

Processing large genomic datasets such as VCF files can be computationally intensive and time-consuming. Leveraging GPU acceleration can significantly reduce processing times, allowing for more rapid data analysis and insights.

We will

  • Applying conditions to filter dataframes based on depth, quality, and allele frequency.
  • Grouping data by chromosome and calculating mean statistics for depth, quality, and allele frequency.
  • Speed comparison of these operations on GPU versus CPU.

Rapids is a suite of open-source software libraries and APIs built on CUDA to enable execution of end-to-end data science and analytics pipelines on GPUs. cuDF is a GPU DataFrame library for loading, joining, aggregating, filtering, and otherwise manipulating data.

Install Rapids

  • RAPIDS: A suite of open-source software libraries and APIs built on CUDA to enable execution of end-to-end data science and analytics pipelines on GPUs.
  • cuDF: Part of the RAPIDS ecosystem, cuDF is a GPU DataFrame library for loading, joining, aggregating, filtering, and otherwise manipulating data.
module load anaconda3/2023.03
conda create -n myrapids -c rapidsai -c conda-forge -c nvidia  \
    rapids=24.04 python=3.11 cuda-version=11.8 -y
conda activate myrapids

Running Rapids

In this example, we will use cuDF to load and filter genomic data efficiently using GPU accelaration

Files in the rapids folder

  • rapids_analysis-gpu.py - GPU version
  • rapids_analysis-cpu.py - CPU version

The rapid_analysis.job will submit the job to the Hoffman2 cluster.

In this file, the line #$ -l gpu,V100 will submit this job to the V100 GPU nodes.

Running Rapids

qsub rapids_analysis.job

💧 H2O.ai ML Example

  • H2O.ai is an open-source platform for machine learning and AI.
  • We will work through an example from H2o-tutorials.
  • The focus is on the Combined Cycle Power Plant dataset.
  • Objective: Predict the energy output of a Power Plant using temperature, pressure, humidity, and exhaust vacuum values.
  • This example, we will use the R API, but H2O.ai has a Python API as well
  • We will use XGBoost, a popular gradient boosting algorithm, to train the model.

Instaling H2O.ai

We will use R and install the H2O.ai package to run the example.

  • Setting up the environment
module load cuda/11.8 
module load gcc/10.2.0
module load R/4.3.0
  • Installing H2O.ai in R
install.packages(c("RCurl", "jsonlite"))
install.packages("h2o", type="source", repos=(c("http://h2o-release.s3.amazonaws.com/h2o/latest_stable_R")))

Running H2O.ai

In the h2oai folder, the h2oaiXGBoost.R script the code to run XGBoost on the Combined Cycle Power Plant dataset.

The h2oML-gpu.job file will submit the job to the Hoffman2 cluster on a GPU node.

qsub h2oML-gpu.job

The h2oML-cpu.job file will submit the job to the Hoffman2 cluster on a CPU node.

qsub h2oML-cpu.job

The h2o.ai functions will automatically detect the GPU and use it for training.

Wrap up

Hoffman2 has the resources and tools to help you leverage the power of GPUs for your research.

Main Takeaways:

  • Use -l gpu option to reserve a GPU node
  • Compile GPU optimize code with CUDA
  • Use Python and R packages for GPU computing

👏 Thanks for Joining! ❤️

Questions? Comments?